Skip to content

fix(mcp): reap serve --mcp child when parent is SIGKILL'd (#277)#286

Merged
colbymchenry merged 3 commits into
colbymchenry:mainfrom
evanclan:fix/mcp-ppid-watchdog-277
May 22, 2026
Merged

fix(mcp): reap serve --mcp child when parent is SIGKILL'd (#277)#286
colbymchenry merged 3 commits into
colbymchenry:mainfrom
evanclan:fix/mcp-ppid-watchdog-277

Conversation

@evanclan
Copy link
Copy Markdown
Contributor

Summary

Adds a process.ppid watchdog to MCPServer.start() so a codegraph serve --mcp child terminates when its MCP host is force-killed. Resolves #277.

Problem

The existing shutdown path (src/mcp/index.ts) leans entirely on signal handlers and stdin close events:

```75:79:src/mcp/index.ts
process.on('SIGINT', () => this.stop());
process.on('SIGTERM', () => this.stop());

// When the parent process (Claude Code) exits, stdin closes.
// Detect this and shut down gracefully to prevent orphaned processes.

```

On Linux that's not enough when the host (Claude Code, opencode, …) is SIGKILL'd by the OOM killer / a `kill -9` / a container teardown:

  • The kernel does not propagate parent death to children.
  • The child gets reparented to init/systemd.
  • Whether the half-closed stdio pipe surfaces `end` / `close` to Node depends on whether anyone else still holds the write end. In the reporter's environment it didn't fire — three orphan `codegraph serve --mcp` processes were pinned across sessions, each holding ~440k inotify watches (which then transitively tripped #276's watch-budget exhaustion in Next.js / IDEs).

Solution

Capture `process.ppid` once at construction, then `setInterval` (default 5s, `.unref()`'d) to check it. The moment it diverges from the baseline, we know the original parent has died and we tear down cleanly:

```text
[CodeGraph MCP] Parent process exited (ppid 9177 -> 1); shutting down.
```

Cross-platform: reparenting changes `process.ppid` on Linux and macOS; on Windows the value drops to 0 once the parent is gone, which also trips the check.

Knobs:

  • `CODEGRAPH_PPID_POLL_MS` env var overrides the poll interval (default 5000ms).
  • `CODEGRAPH_PPID_POLL_MS=0` disables the watchdog entirely — escape hatch for embedded scenarios where the parent deliberately re-parents the server.

`stop()` is now guarded by an idempotency flag so the watchdog can't race the existing stdin-close handlers and double-close the SQLite handle / transport.

Out of scope

  • The companion inotify-watch exhaustion (#276) — this PR removes the orphan-server multiplier on the watch budget, but the underlying `fs.watch({ recursive: true })` registering-everything behavior is its own change.
  • `prctl(PR_SET_PDEATHSIG)` (issue's suggested fix git-hook potential issue when codegraph is not installed globally #2) — pure-JS polling is enough for a failure mode that's already rare, no native build dependency needed.

Test plan

  • New `tests/mcp-ppid-watchdog.test.ts` stands up a four-tier process tree (vitest → wrapper → {stdin-holder, codegraph}) and SIGKILL's the wrapper. The stdin-holder is a long-lived sibling whose `stdout` pipe is dup'd into codegraph's `stdin`, so the wrapper's death does not transitively close codegraph's stdin. That isolates the watchdog from the pre-existing stdin-close path — confirmed by temporarily disabling the watchdog (test fails) vs leaving it in (test passes).
  • 5/5 consecutive local runs at ~925ms each, no flakes.
  • `npx vitest run tests/mcp-ppid-watchdog.test.ts tests/mcp-initialize.test.ts tests/mcp-roots.test.ts` — 6/6 passing.
  • Full `npm test`: 718 passing, 5 failing in `git-hooks.test.ts` / `watcher.test.ts` — both pre-existing on `main` (verified by re-running on `origin/main` HEAD) and unrelated to this change.
  • `npx tsc --noEmit` clean.

Related

Made with Cursor

evanclan and others added 3 commits May 22, 2026 08:48
…ry#277)

On Linux the kernel doesn't propagate parent death to children, and the
existing `stdin.on('end' | 'close')` handlers don't always fire when an
MCP host (Claude Code, opencode, …) is force-killed by the OOM killer,
a `kill -9`, or a container teardown. The reporter in colbymchenry#277 ended up
with three orphan `codegraph serve --mcp` processes pinned across
sessions, each holding its own inotify watch set (~440k watches),
which then tripped colbymchenry#276's watch-budget exhaustion in unrelated tools
(Next.js, IDEs).

Capture `process.ppid` once at server construction, poll it on a
`setInterval`, and shut down the moment it diverges from that baseline.
The interval is `.unref()`'d so it never holds the event loop open on
its own; the poll period is `CODEGRAPH_PPID_POLL_MS` (default 5000ms,
`0` disables for embedded hosts that re-parent on purpose).

The regression test stands up a four-tier process tree
(vitest → wrapper → {stdin-holder, codegraph}) so the wrapper's
SIGKILL doesn't transitively close codegraph's stdin (sibling
stdin-holder keeps the pipe's write-end alive). That isolates the
watchdog from the pre-existing stdin-close path: the test fails
without the watchdog and passes with it.
The watchdog added for colbymchenry#277 watched only process.ppid. On current main,
the --liftoff-only re-exec inserts an intermediate process between the MCP
host and the server; that intermediate outlives the host (blocked in
spawnSync), so the server's own ppid never changes when the host dies and
the watchdog never fires — the regression test fails on main.

Propagate the host PID across the re-exec (CODEGRAPH_HOST_PPID) and have
the watchdog poll it for liveness, keeping the ppid-divergence check for
the direct (bundled) launch path. Validated: watchdog test passes with
re-exec active, and an A/B (watchdog on vs off) reaps the orphan only with
it on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@colbymchenry colbymchenry merged commit fb45959 into colbymchenry:main May 22, 2026
@colbymchenry
Copy link
Copy Markdown
Owner

Merged — thank you @evanclan! 🎉 Genuinely one of the best-documented PRs I've reviewed: the four-tier process-tree test (dup'ing a sibling's stdout into codegraph's stdin so SIGKILL'ing the wrapper doesn't close stdin) isolates the watchdog perfectly.

One thing surfaced during validation that needed a fix on your branch before merge: main moved under the PR. Since you opened it, main gained a --liftoff-only re-exec (the V8 turboshaft WASM workaround), which inserts an intermediate process between the MCP host and the server. That intermediate outlives the host (it's blocked in spawnSync), so the server's own process.ppid never changes when the host dies — and the watchdog never fired. Your regression test caught it cleanly: it failed on the merged state.

Fix (commit on your branch, credited to you via squash): propagate the host PID across the re-exec via CODEGRAPH_HOST_PPID, and have the watchdog poll that PID for liveness in addition to the process.ppid divergence check (which still covers the bundled/direct-launch path). Your test now passes through the re-exec, so it guards both paths.

Validation on macOS (reparent-to-launchd mirrors the Linux case):

  • Your test: green on the merged state.
  • Independent A/B (same binary, watchdog on vs off): orphan reaped only with it on; survives 4s+ with it off — confirming the watchdog is precisely what fixes it.
  • Full suite: 776 passing, 0 failing. (The git-hooks/watcher failures you noted were environment-specific — they pass on the merged state.)

Thanks again — great contribution. 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

serve --mcp is not reaped when the parent Claude Code process is SIGKILL'd (Linux) git-hook potential issue when codegraph is not installed globally

2 participants